Machine Learning Exam

Exercise 1

What is the dimensionality of the samples in the design matrix?

Well the exercise tells us that the columns are the samples, so there are 2 dimensions.
There are 5 samples.

Complete (empirical average) and (covariance matrix)

The matrix is composed of the averages for each dimension.

We compute the covariance matrix in the following way:

Hint

There are a few methods for matrix multiplication:

Basis vectors method

You remember that the columns of the matrix on the left are actually basis vectors.
So you need to substitute them to the basis vectors of the two vectors in the matrix on the right.

Dot product method

You compute the dot product between the rows of the first matrix and the columns of the second matrix.



How do we compute a determinant?

It might be in the exam.

Describe the process that applies those transformations

  1. We center the point cloud by subtracting the mean from all the points.
  2. We get the covariance matrix of the point cloud, which can be computed with
  3. Using PCA/Spectral decomposition, we get the eigenvectors of the covariance matrix.
  4. We apply the transposed eigenvectors matrix so to bring the eigenvectors to our basis vectors.
  5. Now we swap the two eigenvectors/principal components.
  6. We apply the new eigenvector matrix and so that the basis vectors end up on the new eigenvectors.
  7. We add the mean back to all the points.

What happens to the covariance matrix

The variance stays the same but the covariance changes sign:

[[1 -0.8]
 [-0.8 1]]
 
[[1 0.8]
 [0.8 1]]

What happens to the determinant

The absolute value of the determinant doesn't change at all because the space doesn't get stretched or shrinked, just rotated.

Increase the PCs by 5%

Whe need to apply a stretching transformation after step 4.
In order to do that, we need to change the basis vectors with something 5% bigger.


Explain why it's bad to use PCA in this case

The classifier won't work because we are projecting the data in a dimension where the classifier can't tell the difference between the two classes.

These will be the 2 principal components.

If we project all data on the first PC, we get something like this:

You can't distinguish classes in this scatterplot.

What would make the classifier work?

Killing the first principal component and projecting everything on the second.


Exercise 2


Danger

It would be best to go back and study the specifics of the Expectation Maximization algorithm.

Write the density in equation form

Since he doesn't actually want us to compute the results(I THINK), we are gonna write the equations(not the gaussian one).

The following formula gives the density of the GMM at a point , which basically means the total height of all the gaussians summed at that point.


Why the determinant in the denominator?

Note

The determinant of a matrix encodes the change in volume that results from a transformation.

In the Gaussian pdf, we have:

  • in the exponent, which changes the shape of the curve
  • in the denominator, which makes the volume equal to 1.

The one in the denominator is there to counter-act the change in volume that the in the exponent causes(because the volume of a pdf must amount to 1).

Why is not in the denominator?

In Masi's words:

Quote

The location parameter is just translating the distribution in space somewhere thus does not affect the volume.


Write the equation that computes the responsibilities

The exercise is telling us that at , each gaussian has the same height, they just have different mixing coefficients.

So the responsibilities are just the mixing coefficients.


Generate 10K samples from the GMM and reproject them

In this case, after PCA we are left with 10k images in the form of vector of 50 pixels that can have value in between [0, 255].

But Masi also gives us the mean of the original images and the U matrix used to compress them.

Hint

This is useful because we can reverse the process and get a compressed version of the original images.

If we want to generate those samples using a GMM, repeat the following process 10K times:

  1. By using inverse transform sampling on the cumulative mass function of the mixing coefficients, we select which digit we want to generate.
Hint

This is because the gaussians don't have all the same size, so some are more probable to be picked or more present in the GMM.

To solve this, we do a weighted random pick of the gaussian.

  1. We sample from the selected gaussian.
  2. We reproject the sampled image from the subspace to its original dimensions by multiplying it by and adding .


Exercise 3

Danger

Go back and study Gini impurity and Missclassification!

Entropy of tree

The entropy of the tree is given by the entropy of the last leaves weighed by the number of elements contained in them:


Inference on a sample

Masi gives us the sample [-1, -1000].

We know that:

  1. If then we go to leaf .
  2. Then if then we go to leaf .

Since there is equal probability that the class is square and triangle, it's best to say that the algorithm is unsure here.


How to reduce overfitting in a decision tree

We cannot use the following strategies because:

  • Minimizing the leaf size: Would change the depth of the tree
  • Capping the maximum depth: Would change the depth of the tree
  • Conditional leaf generation(A node will be split if this split induces a decrease of the impurity greater than or equal to a value): Would change the depth of the tree

So we use Bagging and Ensamble, which is a technique that consists in splitting the dataset and training a decision tree on each one of the splits.

Then to make inference we average the outputs of the random forest and we have got our model with reduced variance.


How to make random weighed prediction based on the elements in the leaves

We apply inverse transform sampling on the probabilities of the elements in the leaves????????????????????


Exercise 4

What are TPR and FPR?

The true positive rate, also referred to sensitivity or recall, is used to measure the percentage of actual positives which are correctly identified.

FPR is the opposite.

What is the ROC curve and how do we compute it?

ROC curve is a performance measurement for the classification problems at various threshold settings.
It plots TPR and FPR at various threshold values.


Exercise 5

How many parameters?

Layer 1) We have 2048 weights for each neuron of the first layer, so:

BUT THIS LAYER IS NOT TRAINABLE, SO IT DOESNT COUNT.

Layer 2) We have 1024 weights for each neuron of the second layer, so:

Layer 3) We have 512 weights for each neuron of the second layer, so:

When we sum them all toghether, we get the total number of pararmeters:


wtf?


Draw the DAG

Compute the gradient using the chain rule

At the last gate, we compute the derivatives with respect to each component of :

Now, in order to apply the chain rule, we would need to know the derivatives of and .

The only remaining thing is .

Let's kill a fucking gate.


Useful things to remember

ROC and AUC

ROC is computed by TPR and FPR, which are both divided respectively by and .

The AUC is computed by computing:

Useful local gradients

Sigmoid:

Softmax + Cross-entropy:

Product gate:

Sum gate:

+1 gate:

gate: